Topic Modelling on PubMed Abstracts using BERTopic¶
In this notebook, we'll use BERTopic to model topics on the same corpus of PubMed abstracts that we used with modelling with Latent Dirichlet Allocation in the previous exercise (https://github.com/eukairos/topic-models/blob/main/PubMed_LDA_5K.ipynb). Because BERTopic is modular, I could run two models concurrently and compared them too. Along the way, we'll encounter some new concepts as well.
# first check that GPU is available
import torch
torch.cuda.is_available()
True
# instantiate the embedding model
from sentence_transformers import SentenceTransformer
embed_model = SentenceTransformer('neuml/pubmedbert-base-embeddings', device='cuda')
Load data¶
import pandas as pd data = pd.read_csv('pubmed_abstracts.csv') to_drop = ['Title','pmid','meshMajor', 'meshid', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'L', 'M', 'N', 'Z'] data = data.drop(to_drop, axis=1) data = data.sample(n=5000, random_state=42) data = data.reset_index(drop=True) data['abstractText'] = data['abstractText'].str.lower()
do a bit of cleaning by removing numbers¶
import re def remove_numbers(series): def rem_no(text): pattern = r'\b\d+(.\d{1,2})?\b' cleaned_text = re.sub(pattern, '', text) cleaned_text = cleaned_text.strip() return cleaned_text return series.apply(rem_no)
data['no_numbers'] = remove_numbers(data['abstractText']) abstracts = data['no_numbers'].to_list()
# generate the embeddings
embeddings = embed_model.encode(abstracts)
The best combination of parameters were discovered in a separate session (Refer to https://github.com/eukairos/topic-models/blob/main/Best%20UMAP%20n%20HDBSCAN%20hyperparameters%20for%20PubMed.ipynb) and are:
UMAP: n_components = 20, n_neighbors = 10
HDBSCAN: min_cluster_size = 25, min_sample = 10
We'll fit these into our BERTopic pipeline.
Dimensionality Reduction¶
from bertopic import BERTopic
from umap import UMAP
reducer = UMAP(
n_components=20,
n_neighbors=10,
min_dist=0.0,
metric="cosine",
random_state=42,
)
reduced = reducer.fit_transform(embeddings)
Clustering¶
What actually is clustering in BERTopic's context? Essentially, similar documents (or more specifically, their embeddings) are merged together into a superdocument (a 'document cluster'). Thereafter, the BERTopic algorithm only deals with the superdocument clusters.
from hdbscan import HDBSCAN
clusterer = HDBSCAN(
min_cluster_size=25,
min_samples=5,
cluster_selection_method='eom',
metric="euclidean",
gen_min_span_tree=True,
).fit(reduced)
Representation Model¶
By 'representation' BERTopic means representation of what topics are. The default process is:
- Create a Bag-of-Words for each superdocument using CountVectorizer,
- Computes term frequency-inverse document frequency on the superdocument (hence the 'cluster-Tfidf' name). That is, it calculates the importance of words in the superdocument relative to other clusters, thereby identifying the most representative keywords for each topic.
BERTopic's documentation also suggests delaying the CountVectorizer step to after training the model. The advantage of doing so is that we can further fine-tune the representation by using .update_topics(), thereby allowing us to tweak topic representations without having to re-train our models. In fact, that is what we have done in this exercise. Tne cell below is the instantiation of the representation models.
# Instantiate the BERTopic models.
from bertopic.vectorizers import ClassTfidfTransformer
from bertopic.representation import KeyBERTInspired
import openai
ctfidf_model = ClassTfidfTransformer(bm25_weighting=True, reduce_frequent_words=True)
# Model 1: Default (ctfidf only)
topic_model = BERTopic(
embedding_model=embed_model,
umap_model = reducer,
hdbscan_model = clusterer,
ctfidf_model = ctfidf_model)
# Model 2: KeyBERT Representation
representation_model_keybert = KeyBERTInspired()
topic_model_keybert = BERTopic(
embedding_model = embed_model,
umap_model = reducer,
hdbscan_model = clusterer,
ctfidf_model = ctfidf_model,
representation_model = representation_model_keybert
)
While the default representation model is based on clustered Tfidf (ctfidf), KeyBERTInspired is an alternative representation model that selects topic keywords based on semantic similarity rather than statistical frequency. It takes all candidate keywords from documents in a topic, computes embeddings, then computes a centroid embedding for the topic (average of all document embeddings in that topic), and reranks keywords by cosine similarity between their embeddings and the topic centroid. In this way, it captures semantic coherence, and handles synonyms better. It is also less sensitive to document length - ctfidf is, like tfidf, is sensitive to document length.
Other representation models include MaximalMarginalRelevance, which diversifies keywords in a topic, so that the same word won't appear in different topics. You may recall from our LDA exercise that the word "cell" keeps popping up in different topics.
It is also possible to chain more than one representation model together. You can read about it in the BERTopic documentation (https://maartengr.github.io/BERTopic/getting_started/representation/representation.html#chain-models).
# fit the models on the embeddings
topics_1, probs_1 = topic_model.fit_transform(abstracts, embeddings)
topics_2, probs_2 = topic_model_keybert.fit_transform(abstracts, embeddings)
# check outputs of models to see if they make sense
topic_model.get_topic_info()[1:11]
| Topic | Count | Name | Representation | Representative_Docs | |
|---|---|---|---|---|---|
| 1 | 0 | 214 | 0_isolates_vaccine_virus_cattle | [isolates, vaccine, virus, cattle, strains, in... | [chronic wasting disease (cwd) is a fatal prio... |
| 2 | 1 | 132 | 1_visual_task_movement_walking | [visual, task, movement, walking, stimulus, sp... | [objectives: noise often has detrimental effec... |
| 3 | 2 | 121 | 2_images_image_imaging_measurements | [images, image, imaging, measurements, phantom... | [background: endovascular aortic procedures ha... |
| 4 | 3 | 116 | 3_soil_water_wastewater_concentrations | [soil, water, wastewater, concentrations, orga... | [quinclorac, a highly selective auxin herbicid... |
| 5 | 4 | 115 | 4_gastric_pylori_laparoscopic_fecal | [gastric, pylori, laparoscopic, fecal, cholecy... | [the surgical standard for ulcerative colitis ... |
| 6 | 5 | 113 | 5_strains_oil_enzyme_cga | [strains, oil, enzyme, cga, extract, lipase, p... | [oleaginous fungi are of special interest amon... |
| 7 | 6 | 108 | 6_recurrence_lymph_node_pet | [recurrence, lymph, node, pet, nodes, survival... | [background: no agreement has been made about ... |
| 8 | 7 | 106 | 7_nk_cd4_leukemia_il | [nk, cd4, leukemia, il, aml, cells, ifn, cd8, ... | [neoplastic disorders sometimes accompany a re... |
| 9 | 8 | 105 | 8_ventricular_aortic_myocardial_heart | [ventricular, aortic, myocardial, heart, cardi... | [background: we studied the effects of diabete... |
| 10 | 9 | 97 | 9_eyes_lens_cataract_corneal | [eyes, lens, cataract, corneal, glaucoma, intr... | [aims: to quantify the rates of eye preservati... |
topic_model_keybert.get_topic_info()[1:11]
| Topic | Count | Name | Representation | Representative_Docs | |
|---|---|---|---|---|---|
| 1 | 0 | 214 | 0_sera_sj26_antibody_immunization | [sera, sj26, antibody, immunization, rotavirus... | [chronic wasting disease (cwd) is a fatal prio... |
| 2 | 1 | 132 | 1_information_ankle_verbal_feedback | [information, ankle, verbal, feedback, observe... | [objectives: noise often has detrimental effec... |
| 3 | 2 | 121 | 2_reconstruction_cm2_breast_lesions | [reconstruction, cm2, breast, lesions, reconst... | [background: endovascular aortic procedures ha... |
| 4 | 3 | 116 | 3_water_lafeo3_groundwater_efficiency | [water, lafeo3, groundwater, efficiency, matte... | [quinclorac, a highly selective auxin herbicid... |
| 5 | 4 | 115 | 4_cholecystitis_appendicitis_esophageal_laparo... | [cholecystitis, appendicitis, esophageal, lapa... | [the surgical standard for ulcerative colitis ... |
| 6 | 5 | 113 | 5_yeast_indol_enzymes_amino | [yeast, indol, enzymes, amino, enzyme, acrylam... | [oleaginous fungi are of special interest amon... |
| 7 | 6 | 108 | 6_prognosis_prognostic_toxicity_follow | [prognosis, prognostic, toxicity, follow, rare... | [background: no agreement has been made about ... |
| 8 | 7 | 106 | 7_amh_mtb_tuberculosis_leflunomide | [amh, mtb, tuberculosis, leflunomide, cgat, gl... | [neoplastic disorders sometimes accompany a re... |
| 9 | 8 | 105 | 8_aneurysm_aortic_echocardiography_aaa | [aneurysm, aortic, echocardiography, aaa, metf... | [background: we studied the effects of diabete... |
| 10 | 9 | 97 | 9_conjunctival_glaucoma_rnn_glaucomatous | [conjunctival, glaucoma, rnn, glaucomatous, uv... | [aims: to quantify the rates of eye preservati... |
Updating Topics¶
BERTopic suggests removing stopwords at this step, leveraging on sklearn's CountVectorizer's inbuilt stopwords attribute. As in our previous exercise, we also add our custom PubMed stopwords to the native English stopwords.
from sklearn.feature_extraction.text import CountVectorizer, ENGLISH_STOP_WORDS
pubmed_stopwords = pd.read_csv('pubmed_stopwords.csv')
combined_stopwords = list(ENGLISH_STOP_WORDS.union(pubmed_stopwords))
vectorizer_model = CountVectorizer(
stop_words = combined_stopwords,
ngram_range = (1,3),
min_df = 10)
topic_model.update_topics(abstracts, vectorizer_model=vectorizer_model)
topic_model_keybert.update_topics(abstracts, vectorizer_model=vectorizer_model,
representation_model=representation_model_keybert)
Let's see whether our topics have changed after updating
topic_model.get_topic_info()[1:11]
| Topic | Count | Name | Representation | Representative_Docs | |
|---|---|---|---|---|---|
| 1 | 0 | 214 | 0_isolates_virus_strains_infection | [isolates, virus, strains, infection, samples,... | [chronic wasting disease (cwd) is a fatal prio... |
| 2 | 1 | 132 | 1_visual_children_task_movement | [visual, children, task, movement, memory, sti... | [objectives: noise often has detrimental effec... |
| 3 | 2 | 121 | 2_images_imaging_image_measurements | [images, imaging, image, measurements, method,... | [background: endovascular aortic procedures ha... |
| 4 | 3 | 116 | 3_water_concentrations_organic_concentration | [water, concentrations, organic, concentration... | [quinclorac, a highly selective auxin herbicid... |
| 5 | 4 | 115 | 4_gastric_patients_surgical_surgery | [gastric, patients, surgical, surgery, underwe... | [the surgical standard for ulcerative colitis ... |
| 6 | 5 | 113 | 5_activity_strains_acid_enzyme | [activity, strains, acid, enzyme, ph, compound... | [oleaginous fungi are of special interest amon... |
| 7 | 6 | 108 | 6_recurrence_patients_survival_tumor | [recurrence, patients, survival, tumor, node, ... | [background: no agreement has been made about ... |
| 8 | 7 | 106 | 7_cells_il_cell_cd4 | [cells, il, cell, cd4, anti, patients, antigen... | [neoplastic disorders sometimes accompany a re... |
| 9 | 8 | 105 | 8_ventricular_aortic_patients_heart | [ventricular, aortic, patients, heart, cardiac... | [background: we studied the effects of diabete... |
| 10 | 9 | 97 | 9_visual_eye_surgery_disc | [visual, eye, surgery, disc, vision, patients,... | [aims: to quantify the rates of eye preservati... |
topic_model_keybert.get_topic_info()[1:11]
| Topic | Count | Name | Representation | Representative_Docs | |
|---|---|---|---|---|---|
| 1 | 0 | 214 | 0_isolates_isolate_pigs_virus | [isolates, isolate, pigs, virus, strain, patho... | [chronic wasting disease (cwd) is a fatal prio... |
| 2 | 1 | 132 | 1_processing_task_memory_performance | [processing, task, memory, performance, discri... | [objectives: noise often has detrimental effec... |
| 3 | 2 | 121 | 2_ultrasound_imaging_image_real time | [ultrasound, imaging, image, real time, images... | [background: endovascular aortic procedures ha... |
| 4 | 3 | 116 | 3_water_environmental_removal_environment | [water, environmental, removal, environment, a... | [quinclorac, a highly selective auxin herbicid... |
| 5 | 4 | 115 | 4_gastric_resection_surgery_surgical | [gastric, resection, surgery, surgical, operat... | [the surgical standard for ulcerative colitis ... |
| 6 | 5 | 113 | 5_yeast_enzymes_enzyme_fungi | [yeast, enzymes, enzyme, fungi, fungal, substr... | [oleaginous fungi are of special interest amon... |
| 7 | 6 | 108 | 6_breast cancer_recurrence_resection_prognostic | [breast cancer, recurrence, resection, prognos... | [background: no agreement has been made about ... |
| 8 | 7 | 106 | 7_lymphocytes_lymphocyte_cd4_cytokine | [lymphocytes, lymphocyte, cd4, cytokine, cytok... | [neoplastic disorders sometimes accompany a re... |
| 9 | 8 | 105 | 8_left ventricular_heart failure_heart disease... | [left ventricular, heart failure, heart diseas... | [background: we studied the effects of diabete... |
| 10 | 9 | 97 | 9_eye_vision_visual_laser | [eye, vision, visual, laser, anterior, aqueous... | [aims: to quantify the rates of eye preservati... |
Evaluation¶
As in the previous exercise, we use coherence to evaluate our models. We used UMass in that exercise. Here we introduce two other coherence metrics, C_V and NPMI. C_V and NPMI, like UMass, calculate co-occurrence (if two words belong to the same topic, they should appear together often) but they use different math to calculate it. We used UMass in our last exercise for convenience because it is fast, and does not need external references. C_V is computationally expensive, but is usually regarded as the 'best'.
from gensim.models.coherencemodel import CoherenceModel
from gensim.corpora import Dictionary
from gensim import corpora
def get_topic_words(topic_model, top_n=20):
"""Extract top-n words per topic from a BERTopic model, excluding outlier topic -1."""
topic_words = []
for topic_id in topic_model.get_topic_freq()["Topic"]:
if topic_id == -1:
continue
words = [w for w, _ in topic_model.get_topic(topic_id)[:top_n]]
if words:
topic_words.append(words)
return topic_words
def tokenize_docs(docs):
"""Simple whitespace tokenizer — replace with your preprocessing if needed."""
return [doc.lower().split() for doc in docs]
def topic_diversity(topic_words, top_n=20):
"""
Proportion of unique words across all topic top-word lists.
Score of 1.0 = all words unique across topics (maximum diversity).
Score approaching 0 = heavy repetition of the same words across topics.
"""
all_words = [w for words in topic_words for w in words[:top_n]]
if not all_words:
return 0.0
return round(len(set(all_words)) / len(all_words), 4)
def compute_coherence(topic_words, tokenized_docs, coherence="c_v"):
"""Compute coherence score for a list of topic word lists."""
dictionary = Dictionary(tokenized_docs)
corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]
cm = CoherenceModel(
topics=topic_words,
texts=tokenized_docs,
dictionary=dictionary,
corpus=corpus,
coherence=coherence,
)
return cm.get_coherence(), cm.get_coherence_per_topic()
def evaluate_model(name, topic_model, docs, top_n=20):
"""Run full evaluation for a single BERTopic model."""
topic_words = get_topic_words(topic_model, top_n=top_n)
tokenized = tokenize_docs(docs)
cv_score, cv_per_topic = compute_coherence(topic_words, tokenized, "c_v")
umass_score, umass_per_topic = compute_coherence(topic_words, tokenized, "u_mass")
npmi_score, npmi_per_topic = compute_coherence(topic_words, tokenized, "c_npmi")
diversity = topic_diversity(topic_words, top_n=top_n)
topic_freq = topic_model.get_topic_freq()
n_topics = len(topic_freq[topic_freq["Topic"] != -1])
noise_docs = topic_freq[topic_freq["Topic"] == -1]["Count"].values
noise_pct = 100 * noise_docs[0] / len(docs) if len(noise_docs) > 0 else 0.0
return {
"model": name,
"n_topics": n_topics,
"noise_pct": round(noise_pct, 2),
"c_v": round(cv_score, 4),
"c_npmi": round(npmi_score, 4),
"c_umass": round(umass_score, 4),
"diversity": diversity,
# Per-topic lists for deeper inspection
"_cv_per_topic": cv_per_topic,
"_npmi_per_topic": npmi_per_topic,
"_topic_words": topic_words,
}
models = {
"Default": topic_model,
"KeyBERT": topic_model_keybert,
}
records = []
per_topic_data = {}
for name, model in models.items():
print(f"Evaluating {name}...")
result = evaluate_model(name, model, abstracts, top_n=20)
per_topic_data[name] = {
"topic_words": result.pop("_topic_words"),
"cv_per_topic": result.pop("_cv_per_topic"),
"npmi_per_topic": result.pop("_npmi_per_topic"),
}
records.append(result)
Evaluating Default... Evaluating KeyBERT...
df = pd.DataFrame(records).set_index("model")
print("\n── Evaluation Summary ───────────────────────────────────────────────")
print(df.to_string())
── Evaluation Summary ───────────────────────────────────────────────
n_topics noise_pct c_v c_npmi c_umass diversity
model
Default 61 28.26 0.5226 -0.0552 -4.5024 0.7377
KeyBERT 61 28.26 0.5186 -0.1052 -5.2371 0.8000
For C_V, the closer the number to 1, the better the coherence. For both NPMI and UMass, the closer the number to zero, the better the coherence.
The Default model has better coherence but has less topic diversity. Let's choose it as our model.
First, let's use an LLM to convert the default topic labels to something more intelligible. I'm using MedGemma through Ollama here
import ollama, json
SYSTEM_PROMPT = '''You are a biomedical expert. Given keywords from a BERTopic topic model trained on PubMed abstracts,
return a JSON object with the key "topic_label" containing a topic label of NO MORE THAN FIVE WORDS.
Example output: {"topic_label": "Cancer Cell Biology"}'''
def label_topic(topic_id, keywords):
kw_str = ', '.join(keywords)
response = ollama.chat(
model = 'MedAIBase/MedGemma1.5:4b',
format = 'json',
options = {'temperature': 0},
messages = [
{'role': 'system', 'content': SYSTEM_PROMPT},
{'role': 'user', 'content': f'Keywords: {kw_str}'}])
result = json.loads(response['message']['content'])
label = result['topic_label'].strip()
words = label.split()
if len(words) > 5:
label = ' '.join(words[:5])
return label
# Extract keywords from BERTopic and label each topic
topic_labels = {}
for topic_id in topic_model.get_topic_freq()["Topic"]:
if topic_id == -1:
continue
keywords = [word for word, _ in topic_model.get_topic(topic_id)[:20]]
label = label_topic(topic_id, keywords)
topic_labels[topic_id] = label
print(f"Topic {topic_id:>3d} : {label}")
# Apply labels to the model
topic_model.set_topic_labels(topic_labels)
Topic 0 : Viral Infection Topic 1 : Visual Perception in Children Topic 2 : Medical Imaging Topic 3 : Water Treatment Topic 4 : Surgical Complications in Gastric Patients Topic 5 : Plant Oil Production Topic 6 : Breast Cancer Survival Topic 7 : Immune Cell Activation Topic 8 : Heart Disease Topic 9 : Visual Surgery Outcomes Topic 10 : Medical Education Topic 11 : Mental Health Assessment Topic 12 : Ovarian Hormone Levels in Pregnancy Topic 13 : Protein Structure and Function Topic 14 : Genetic Syndromes Topic 15 : Hemorrhage Diagnosis Topic 16 : HIV Drug Resistance Topic 17 : Health Services Research Topic 18 : Dental Implant Materials Topic 19 : Protein Biochemistry Topic 20 : Lung Function and Oxygenation Topic 21 : Sexual Health Education Topic 22 : Kidney Function and Hypertension Topic 23 : Breast Cancer Screening Topic 24 : Knee Joint Reconstruction Topic 25 : Plant Genetics Topic 26 : Protein Aggregation Topic 27 : Dietary Fat Intake Topic 28 : Nitric Oxide Signaling Topic 29 : Inflammation in Airway Cells Topic 30 : Liver Metabolism Topic 31 : Mass Spectrometry Techniques Topic 32 : Drug Delivery Systems Topic 33 : Lipid Metabolism in Mice Topic 34 : Drug Effects on Neurons Topic 35 : Coronary Artery Disease Topic 36 : Pain Management Procedures Topic 37 : Maternal and Infant Health Topic 38 : Ion Channel Regulation Topic 39 : Renal Function and Imaging Topic 40 : Fungal Pulmonary Disease Topic 41 : Hepatitis B Serology Topic 42 : Cell Migration and Matrix Interactions Topic 43 : Breast Cancer Prognosis Topic 44 : Synaptic Potentials Topic 45 : Genetic Susceptibility Topic 46 : Plant Growth and Physiology Topic 47 : Obesity and Insulin Resistance Topic 48 : Mitochondrial Evolution Topic 49 : Chemotherapy Side Effects Topic 50 : Spinal Cord Injury Topic 51 : Neurotransmitter Receptor Binding Topic 52 : Pancreatic Cancer Cell Biology Topic 53 : Diabetes Mellitus Research Topic 54 : Exercise Physiology Topic 55 : Cellular Processes Topic 56 : Liver Disease Severity Topic 57 : Catalysis and Synthesis Topic 58 : Trauma and Emergency Care Topic 59 : Wound Healing and Tissue Regeneration Topic 60 : Zinc Metal Complex
Visualization¶
# Visualizing Topics
import plotly.io as pio
pio.renderers.default='notebook'
topic_model.visualize_topics()
# visualizing documents
# reduce dimensionality of embeddings first, speeds up visualization
reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)
topic_model.visualize_documents(
abstracts,
reduced_embeddings=reduced_embeddings,
topics=list(range(20)),
custom_labels=True)
# Visualizing clustering
topic_model.visualize_hierarchy(custom_labels=True)
# Topic similarity heatmap
topic_model.visualize_heatmap(top_n_topics=20, custom_labels=True)
# Top-n words associated with a topic
topic_model.visualize_barchart(
n_words = 10,
custom_labels = True,
height = 500
)
Extracting Information¶
# find similar topics
similar_topics, similarity = topic_model.find_topics('diabetes', top_n=5)
output = topic_model.get_topic(similar_topics[0])
print(output)
[('insulin', np.float64(0.07505057041840049)), ('glucose', np.float64(0.06449658395331292)), ('diabetic', np.float64(0.05262164962008938)), ('rats', np.float64(0.046001592272389244)), ('beta', np.float64(0.03388939750848574)), ('pancreas', np.float64(0.02949886994259258)), ('secretion', np.float64(0.024190927374734113)), ('diabetes', np.float64(0.02414189329120224)), ('mice', np.float64(0.023805660928468596)), ('pancreatic', np.float64(0.022874374276756034))]
# find all representations of a topic
output1 = topic_model.get_topic(15, full=True)
print(output1)
{'Main': [('patients', np.float64(0.027917486608856703)), ('diagnosis', np.float64(0.026417612300799875)), ('bleeding', np.float64(0.024901275791185475)), ('therapy', np.float64(0.015319996802860718)), ('hemorrhage', np.float64(0.015217886572923122)), ('cranial', np.float64(0.014582847977053592)), ('ci', np.float64(0.013687877858091198)), ('cases', np.float64(0.01330872942611377)), ('risk', np.float64(0.01309258588671222)), ('case', np.float64(0.012850121239623225))]}
# find distributions of topics in documents
# first, get the distributions for all documents in the corpus
topic_distr, _ = topic_model.approximate_distribution(abstracts, window=8, stride=4, use_embedding_model=True)
print(topic_distr)
Batches: 0%| | 0/1427 [00:00<?, ?it/s]
Batches: 0%| | 0/1426 [00:00<?, ?it/s]
Batches: 0%| | 0/1388 [00:00<?, ?it/s]
Batches: 0%| | 0/1422 [00:00<?, ?it/s]
Batches: 0%| | 0/1411 [00:00<?, ?it/s]
[[0. 0.00631404 0. ... 0.2550849 0. 0. ] [0. 0.04006515 0.03722432 ... 0.05020912 0. 0. ] [0. 0. 0.00963105 ... 0.01278466 0. 0. ] ... [0. 0. 0. ... 0.00644452 0. 0. ] [0. 0.00488603 0.02214933 ... 0.08309011 0. 0. ] [0.01363206 0.00864306 0.00398407 ... 0.0155118 0.00516646 0. ]]
# then select a document to visualize
import plotly.graph_objects as go
# make sure to input probabilities for a single document.
output2 = topic_model.visualize_distribution(topic_distr[0], custom_labels=True)
fig = go.Figure(output2)
fig.show()
# for reference here is the corresponding abstract
abstracts[0]
'we reviewed the patterns of injuries sustained by consecutive fallers and jumpers in whom primary impact was onto the feet. the fall heights ranged from to ft. the patients sustained significant injuries. skeletal injuries were most frequent and included lower extremity fractures, four pelvic fractures, and nine spinal fractures. in two patients, paraplegia resulted. genitourinary tract injuries included bladder hematoma, renal artery transection, and renal contusion. thoracic injuries included rib fractures, pneumothorax, and hemothorax. secondary impact resulted in several craniofacial and upper extremity injuries. chronic neurologic disability and prolonged morbidity were common. one patient died; the patient who fell ft survived. after initial stabilization, survival is possible after falls or jumps from heights as great as feet it is important to recognize the skeletal and internal organs at risk from high-magnitude vertical forces.'